# Multimodal Large Model

INFRL Qwen2.5 VL 72B Preview Ggufs Fully Quantized
Apache-2.0
An improved vision-language model based on Qwen2.5-VL-72B-Instruct, excelling in multiple visual reasoning benchmarks
Text-to-Image English
I
GeorgyGUF
230
0
Finetune VQA 1B
Apache-2.0
A visual question answering model fine-tuned based on InternVL3-1B and Vintern-1B-v3_5, supporting Vietnamese, suitable for image content understanding and question-answering tasks.
Text-to-Image Other
F
TienAnh
20
0
Emova Qwen 2 5 3b
Apache-2.0
EMOVA is an end-to-end omni-modal large language model that supports visual, auditory, and speech functions, capable of generating text and speech responses with emotional control.
Multimodal Fusion Transformers Supports Multiple Languages
E
Emova-ollm
25
2
Internvl3 1B Hf
Other
InternVL3 is an advanced series of multimodal large language models, demonstrating exceptional multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text Transformers Other
I
OpenGVLab
1,844
2
Internvl3 78B Pretrained
Other
InternVL3-78B is an advanced multimodal large language model developed by OpenGVLab, demonstrating exceptional comprehensive performance. Compared to its predecessor InternVL 2.5, it possesses stronger multimodal perception and reasoning capabilities, extending its abilities to new domains such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.
Text-to-Image Transformers Other
I
OpenGVLab
22
1
Qwen2.5 Omni 7B GPTQ 4bit
MIT
A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.
Multimodal Fusion Safetensors Supports Multiple Languages
Q
FunAGI
3,957
51
Internvl 2 5 HiCo R16
Apache-2.0
InternVideo2.5 is a video multimodal large language model (MLLM) enhanced by long and rich context (LRC) modeling, built upon InternVL2.5.
Text-to-Video Transformers English
I
FriendliAI
129
1
Internvideo2 5 Chat 8B
Apache-2.0
InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.
Video-to-Text Transformers English
I
OpenGVLab
8,265
60
Mplug Owl3 7B 241101
Apache-2.0
mPLUG-Owl3 is an advanced multimodal large language model that focuses on solving the problem of long image sequence understanding. It significantly improves the processing speed and sequence length support through the hyper attention mechanism.
Text-to-Image Safetensors English
M
mPLUG
302
10
Llm Jp 3 Vila 14b
A large-scale vision-language model developed by Japan's National Institute of Informatics, supporting Japanese and English with strong image understanding and text generation capabilities.
Image-to-Text Japanese
L
llm-jp
106
10
Pixtral 12B Captioner Relaxed
Apache-2.0
An instruction-fine-tuned version based on the Pixtral-12B-2409 multimodal large language model, capable of generating richer detail descriptions for given images
Image-to-Text Transformers English
P
Ertugrul
79
24
Docowl2
Apache-2.0
mPLUG-DocOwl2 is an OCR-free multimodal large language model for multi-page document understanding, efficiently encoding document content via a high-resolution document compressor.
Image-to-Text Safetensors English
D
mPLUG
482
99
Chartmoe
Apache-2.0
ChartMoE is a multimodal large language model based on InternLM-XComposer2, featuring a mixture of experts connector with advanced chart capabilities.
Image-to-Text Transformers
C
IDEA-FinAI
250
12
Kangaroo
Apache-2.0
Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.
Video-to-Text Transformers Supports Multiple Languages
K
KangarooGroup
163
12
Internlm Xcomposer2 Vl 7b
Other
InternLM-XComposer2 is a vision-language large model developed based on InternLM2, featuring outstanding image-text understanding and creation capabilities.
Text-to-Image Transformers
I
internlm
1,902
82
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase